11  ggplot 2: the appearance and lay out of a plot

In the previous chapter, Chapter 10, we covered the various geometries. We now turn to th appearance and layout of the plot. The appearance refers to the scales including guides, facets and coordinate systems of the grammar of graphics (Chapter 9). The layout components to the non-data parts. Scales are important in data visualization because they determine the visual appearance of your plot. Doing so, they should help the reader with the interpretation of your data visualization. You do so using colors and shapes, transformations of the variables on the x- or y-axis that determine you see the data, or selecting the range of these axis. In addition, you can specify labels or breaks for the axis or titles and labels for the guides. Because these can also abused to mislead people, being aware of the possibilities that these scales offer and how they help to (mis-) inform people is important. There are several guides that might help you selecting axis, grids, … For instance the UK’s Office for national statistics had a full guide on [axis] (https://service-manual.ons.gov.uk/data-visualisation/guidance/axes-and-gridlines) and [colors] (https://service-manual.ons.gov.uk/data-visualisation/colours/using-colours-in-charts). Facets split the plot into subplots, for instance, to show the relation between two variables per value of a third variable. We wil discuss coordinate systems, but do so only briefly. The non-data parts end this chapter.

Scales including guides, facets, coordinate systems and themes each add a layer to the plot.

Before we start, we need to load some packages

and import some data:

life_df <- readr::read_csv(here::here("data", "raw", "life_df.csv"), show_col_types = FALSE)
data_beveridge <- read_csv(here::here("data", "raw", "data_beveridge.csv")) 
Rows: 76 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): TIME PERIOD, FREQUENCY
dbl  (2): vacancy_rate, unemployment_rate
date (1): DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_beveridge <- data_beveridge |> rename(date = DATE, quarter = `TIME PERIOD`)

11.1 Scales

In the previous section, we focused on the geometries and hardly touched upon the scales of e.g. the x and y axis or the color, size or fill aesthetics. In this section we will take a closer look at this component of the grammar of graphics. If you looked at the code for the graph with life expectancy, you saw some lines such as

scale_x_continuous(
 transform = "log",
 breaks = c(100, 1000, 10000, 100000),
 labels = scales::label_currency(prefix = "$")) +

or

scale_colour_paletteer_c("grDevices::Purple-Green")

These were two lines that change the scale of an aesthetic. The first changes the x-axis and the second changes the color used to show different values of life expectancy. In general the scale functions have a similar format: scale_AES_continuous/discrete. The word ‘scale’ is almost always part of the function. It refers to the scales component of the grammar of graphics. The AES part refers to the aesthetic: x, y, color, fill, linewidth, alpha, size, shape. Scales can be continuous or discrete. Not all aesthetics are suitable for continuous variables. For instance, there are only so much shapes as there are available in R. In addition, for some scale, there is also a manual variant. There are also some specific scales, e.g. for data/time variables, and scales to develop specific color palettes. Here, we will start with the continuous scales, we will then cover the discrete scales and end with the manual scales. In addition to the {ggplot2} scale functions, the {scales} package adds further possibilities. If you use that package, we’ll do so using scales::. In that way, it is clear what part of the code is {ggplot2} and what part of {scales}.

Some scale functions are identical: scale_x_continuous() and scale_y_continuous() or scale_x_discrete() and scale_y_discrete() are identical, the same holds for e.g. scale_color_manual(), scale_fill_manual(), scale_shape_manual(), scale_size_manual(), scale_linewidth_manual() and scale_linetype_manual(). All these scales allow you to set the values for the aesthetic. They only differ in terms of the way to identify that aesthetic: a shape number, a color or size. In other words, although there are many scale functions, a lot of them share many similarities. The fact that you have so many of them is e.g. due to the fact that for each continuous numeric scale, you have an identical x and y scale, for each color scale, you have a discrete, continuous and manual scale.

11.1.1 Continuous scales

We first start with numeric scales. Usually, these are for the x- and y-axis. The scales for the horizontal and vertical axis are position scales: they show a individual value will be mapped on the axis. Other aesthetics that include continuous values are e.g. color, fill or size. For color or fill, you have diverging scales, where colors change from e.g. green to red or from blue to yellow and sequential scales (various shades of e.g. green). We first start with the continuous numeric scales, we then move on to the other scales. We’ll discuss the first scale in detail. For other scales, we’ll only highlight arguments that are specific for these scales.

11.1.1.1 Position scales: scale_x_continuous() and scale_y_continuous()

Position scales show the position of a value on the horizontal and vertical axis. In other words, they determine where you have to look to find the values and how these values will be shown. In {ggplot2} scale_x_continuous and scale_y_continuous are function to set the characteristics of the scales of the x- and y-axis. The reference to continuous implies that these scales are used for continuous data. Because both these function are identical, we will discuss only one, scale_x_continuous(). The arguments of this function - i.e. the scale parameters that you can adjust - are

scale_x_continuous(
  name = waiver(),
  breaks = waiver(),
  minor_breaks = waiver(),
  n.breaks = NULL,
  labels = waiver(),
  limits = NULL,
  expand = waiver(),
  oob = censor,
  na.value = NA_real_,
  transform = "identity",
  trans = deprecated(),
  guide = waiver(),
  position = "bottom",
  sec.axis = waiver()
)

We’ll use Figure 11.1 to illustrate these arguments.

Figure 11.1: Scales
11.1.1.1.1 The title of the axis:

By default, the name of the scale is taken from the aesthetic. For scale_x_continuous() that is the variable mapped on the x-asis. If you add a name = your_name, this name will be shown as the title on the axis. You can set these names also using the labs() function. Using labs(x = "name of x") for instance will use “name of x” as the title for the x-axis. Is you set this to NULL in scale_x_continuous, the name will dropped, even if you specify a name in the labs(x= "name of x") function. You can drop these titles via labs() using labs(x = ""). In Figure 11.1, for scale_x_continious() that means that “GDP per capita (USD, log scale)” would not be shown on the plot. For scale_y_continuous() setting names = NULL would drop the name (the name of the variable or the name given using labs() from the y-axis). We’ll meet this function for other scale_*_x() functions as well. For most of these, the variables mapped on the * scale (e.g. color, fill or size) are shown in the legend. Setting name = "name of aesthetic" add the “name of aesthetic” to the legend. Using name = NULL drops the title from the legend, but not the legend itself. In Figure 11.1, the names for the color and size aesthetic are “Region” and “Population”. Here too, you can set those using the labs() function e.g. labs(color = "Region", size = "Population") adds both titles to the legend. Using labs(color = "", size = "") would remove these titles.

11.1.1.1.2 Breaks and labels

The breaks and minor_breaks sets the major and minor breaks for the axis. In ?fig-scale_x_cont, the major breaks are 1000, 10000 and 100000 on the x-axis and 50, 60, 70 and 80 on the y-axis. If you specify in the themes() function (see Chapter 11) that R needs to add major grid lines, R will do so in line with the breaks and will add minor grid lines in line with the minor breaks. In ?fig-scals_x_cont, the major grid lines are shown with dashed grey lines; the minor grid lines with dotted grey lines. The default values, for both are waiver() and allow R to determine both breaks from the data. By default, R adds one minor break point between each of the major breakpoints. Changing waiver() into NULL, will remove all major and/or minor break points. A third option allows you to set your own breakpoints using a vector such as c(1000, 10000, 100000) or a function that generates these breakpoints e.g. seq(from = x, to = y, by = z). The {scales} packages allows you to set the number of major breaks for the most often used cases:

  • scales::breaks_extended(n = ...) with n the number of major breaks,
  • scales::breaks_width(width, offset = 0) with width the width between major breaks and offset if you don’t want breaks to start at e.g. 0 but at e.g. -2 (offset = -2)
  • scales::breaks_log(n = 5, base = 10) to set nice breaks for log-axis (as integer powers of base),

For minor breaks, {scales} includes minor_breaks_width() and minor_breaks_n(). Using scale_x_continuous(), you can set the number of breaks also using n.breaks =. However, this is only possible is breaks equals its default waiver().

You can add labels to major breaks. In Figure 11.1, these labels are “50”, “60”, “70,”80” on the vertical axis and “$1,000.00”, “$10.000,00” and “$100.000,00” on the horizontal axis. By default, ggplot() will add the labels from the data. With labels = NULL, you remove the labels. A third options is to add a vector with labels. As there is one label per major break, this vector should have the same length as the number of breaks. You can also add a function, e.g. paste0, paste, … . Here the function needs to output the same number of labels as there are major breaks. For instance using

paste("$", seq(from = 10000, to = 70000, by = 10000), sep = "")
[1] "$10000" "$20000" "$30000" "$40000" "$50000" "$60000" "$70000"

for labels and

seq(from = 10000, to = 70000, by = 10000)
[1] 10000 20000 30000 40000 50000 60000 70000

for breaks with set major breaks at 10000, 20000, … 70000 and add labels $10000, $20000, … $70000.

As with breaks, {scales} includes a number of ways to deal with labels. This package includes, e.g. 

  • scales::label_number() to show numbers as numbers and not in scientific format. You can define the rounding (accuracy), scale factor to use for multiplication before the label is added (e.g. if you want to show values in 1,000,000 of e.g. the population of a country, use a scaling factor 1/1,000,000 will divide all values by 1,000,000 and show these as a label), a suffix or prefix, a decimal and thousands mark, details to show positive and negative values (e.g. with a + and -, a minus for negative values, nothing for positive, parenthesis of a unicode figure space or minus sign)
  • scales::label_currency() generate labels include a currency prefix or suffix, including the option to set the decimal and thousand mark and a scaling factor that will be used to multiple the values before adding the label (e.g. with scaling factor x = 1/1000, values are in thousands of currency, e.g. the value €25.000 is shown as €25)
  • scales::label_percent() with the optional scaling factor, which is by default 100 (if percentages in the dataset are 0.01, they will be multiplied with 100 to add as label), a prefix of suffix and decimal and thousands mark,
  • scales::label_scientific() to show labels using scientific notation (e.g. 10e02, 5e-04),
  • scales::label_ordinal() allowing you to add a suffix or prefix, e.g. 2nd, 3th with rules defined by default for ordinal_english() but optional also available for ordinal_french(gender = c("masculine", "feminine"), plural = FALSE) and ordinal_spanish()

Before we continue with the other options, let’s see what these options do. To illustrate, we’ll use the diamonds dataset, filter the observations with carat <= 3 and take a sample of 20% of these observations. Mapping price on the vertical axis and carat on the horizontal axis and using lightsteelblue as a setting the for color of the dots: the base graph looks like:

pl_scale_base <- diamonds |> filter(carat <= 3) |>  slice_sample(prop = 0.20) |>
  ggplot(aes(x = carat, y = price)) +
  geom_point(color = "lightsteelblue") +
  theme_minimal()
pl_scale_base

We will use both scale_x_continuous and scale_y_continuous to change the plot. Doing so, we can show multiple aspects at the time. Let’s change the number of breaks on the horizontal axis using a function show breaks starting at 0, ending at 3 but in steps of 0.5 and, using vector with 3 values and show breaks for the vertical axis using a vector and set them at 0, 7500 and 15000. We will also change the name of the vertical axis and add a reference to USD in the title.

pl_scale_base +
  scale_x_continuous(breaks = seq(from = 0, to = 3, by  = 0.5)) +
  scale_y_continuous(
    name = "price, in USD",
    breaks = c(0, 7500, 15000))

Notice how ggplot() also adjusted the minor breaks and adds one in the middle of the major breaks. Let’s remove the minor breaks from the horizontal axis and add minor breaks every 1000 USD on the vertical axis, but only between 5000 and 15000. Starting from the base plot:

pl_scale_base +
  scale_x_continuous(minor_breaks = NULL) +
  scale_y_continuous(minor_breaks = seq(5000, 15000, by = 1000))

Using {scales} we can set the width of the breaks on the horizontal axis equal to 0.25 and set the number of breaks equal to 10 on the vertical axis:

pl_scale_base +
  scale_x_continuous(breaks = scales::breaks_width(0.25)) +
  scale_y_continuous(breaks = scales::breaks_extended(n = 10))

To keep the default number of major breaks for the x-axis but change the number of minor breaks to 15:

pl_scale_base +
  scale_x_continuous(minor_breaks = scales::minor_breaks_n(n = 15))

Let’s now turn to the labels and add “USD” to the labels on the vertical axis and “ct” to the labels on the horizontal axis. To so do, we will set the breaks to ensure that the number of labels equals the number of breaks:

pl_scale_base +
  scale_x_continuous(
    breaks = seq(0, 3, by = 1), 
    labels = paste(seq(0, 3, by = 1), "ct", sep = "")) +
  scale_y_continuous(
    breaks = c(0, 5000, 10000, 15000), 
    labels = paste0("$", c(0, 5000, 10000, 15000)))

Let’s now use {scales} and show the dollar data in euro data using a scale factor EUR = 1.10 USD, add a euro sign, show carat in mg (1ct = 200mg) and add mg. We’ll add a space for the thousands separator and a dot for the decimal mark on the vertical axis:

pl_scale_base +
  scale_x_continuous(
    labels = scales::label_number(
      scale = 200,
      suffix = "mg")) +
  scale_y_continuous(
    labels = scales::label_currency(
      scale = 1/1.10,
      prefix = "€",
      big.mark = " ",
      decimal.mark = "."))

Note that R shows does not adjust the values on the vertical axis. To do so, we need to adjust the breaks as well. Because we divide the labels, we multiply the breaks:

pl_scale_base +
  scale_x_continuous(
    labels = scales::label_number(
      scale = 200,
      suffix = "mg")) +
  scale_y_continuous(
    breaks = seq(0, 15000 * 1.10, by = 5000 * 1.10), 
    labels = scales::label_currency(
      scale = 1/1.10,
      prefix = "€",
      big.mark = " ",
      decimal.mark = "."))

11.1.1.1.3 Setting a ranges

expand = waiver() allows you to add some space between the data and the axis. By default, R expands the scale by 5% for continuous variables and 0.6 unit on each side for discrete variables. Using expansion(mult = 0, add = 0) you can change these values. If you supply a vector with two values for mult, R will add more space equal the the first component on the lower limit and equal to the second component of that vector to the upper limit. If you include one value, R adds space to equal to that value on both sides. Similarly, for add, adding two values in a vector allows you to add space on both sides, with one value, that value if used to add the same space on both sides. To illustrate, let’s add 1 unit of space on the lower end of the horizontal axis and 2 on the upper end. Here, one unit is 1 carat. We’ll expand the vertical axis with 15% on the lower end and 25% on the upper end:

pl_scale_base +
  scale_x_continuous(expand = expansion(add = c(1, 2))) +
  scale_y_continuous(expand = expansion(mult = c(0.15, 0.25)))

The arguments limits = NULL, oob = censor and na.value = NA_real_ deal with the value and the range of the values shown on the axis. Using limits you can set the range of values shown on a axis. By default (NULL), ggplot() shows all the values in the range. Adding a vector with limits, reduces the range of the axis to the limits in the vector. You can also specify a function that returns an upper and a lower value for the range. The question is what happens to the values outside of the range? This is what oob determines. By default, censor sets all the out of bounds (oob) values to NA. You can change that using {scales} to squish which replaces out of bounds values with their nearest range limit and keep which keeps the values. The way in which you handle out of bounds values in case you set limits on the range has important implications for other geoms. Using the default censor R sets these values to NA for the plot. R does so before the plot is drawn. Any other geometry will not be able to access these values. For instance, geom_smooth() will only be able to smooth data that is not missing, i.e. data that is within the limits of the range. Using keep avoids this. There is another alternative: you can set the limits in the coordinate component of the grammar. Doing so, keeps the values of the out or bounds values as they are. Setting limits in the coordinate component zooms in on the data. To see the difference, let’s set limits on the data range to 0.5 - 2.5 on the carat variable. We’ll do so withing the scale_x_continuous functions using limits = c(0.5, 2.5). We’ll also use the default value for the out of bounds values: censor and create a second plot with the keep option from {scales}. Third, we’ll set the range in the coordinate. Here, we use xlim = c(0.5, 2.5). The final plot uses all values in the dataset.

pl_scale_1 <- pl_scale_base + scale_x_continuous(limits = c(0.5, 2.5), oob = scales::oob_censor) + geom_smooth() + labs(title = "oob_censor", x = "")
pl_scale_2 <- pl_scale_base + scale_x_continuous(limits = c(0.5, 2.5), oob = scales::oob_keep) + geom_smooth() + labs(title = "oob_keep", x = "")
pl_scale_3 <- pl_scale_base + coord_cartesian(xlim = c(0.5, 2.5)) + geom_smooth() + labs(title = "xlim")
pl_scale_4 <- pl_scale_base + geom_smooth() + labs(title = "full dataset")

pl_scale_1 + pl_scale_2 + pl_scale_3 + pl_scale_4
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 3601 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3601 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Here you can see the effect of oob = oob_censor. Relative to the options where R keeps all out of bounds values (oob_keep and xlim), oob_censor doesn’t use all data to estimate the smoothed line. The other two show a clear downpart part at the end. You can see that this downward part is part of the “true” smoothed line in the plot that uses the full dataset. In other words, using the limits argument in scale_x/y_continuous to limit the range of values on the x or y axis (or both) comes with a warning: you can do so, but only if the geometries do not use the data to calculate e.g. a smoothed line, a density, … . Here, these geometries will use a restricted dataset. If you do so, be explicit on the treatment of out of bounds values and always add either scales::oob_censor or scales::oob_keep. Or, set the ranges using the xlim or ylim arguments. You can do so both within but also outside of the coordinate argument. In other these are both equivalent

pl_scale_5 <- pl_scale_base + coord_cartesian(xlim = c(0.5, 2.5)) + labs(title = "coordinate(xlim = c(0.5, 2.5))")
pl_scale_6 <- pl_scale_base + xlim(0.5, 2.5) + labs(title = "xlim(0.5, 2.5)")

pl_scale_5 + pl_scale_6
Warning: Removed 3601 rows containing missing values or values outside the scale range
(`geom_point()`).

You can also extend the range. This might be useful is you want to include 0 as a value or if you want to explicitly show values larger than the maximum. This can be useful if you compare graphs. For instance, let’s first create a plot that uses only prices < 10000 and carat < 3 and compare that plot with the base plot:

pl_scales_7 <- diamonds |> filter(price < 10000 & carat < 3) |> ggplot(aes(x = carat, y = price)) + geom_point(color = "lightsteelblue") + theme_minimal()

pl_scales_7 + pl_scale_base

Notice how the y-axis in both plots differs. In other words, comparing both is difficult. To deal with this, it is useful to expand the limits on the first plot and set them equal to those on the second. To so so, we will set both limits equals to 0 - 20000:

pl_scales_7 <- pl_scales_7 + scale_y_continuous(limits = c(0, 20000)) + labs(title = "Price < 10000, carat < 3")
pl_scale_base2 <- pl_scale_base + scale_y_continuous(limits = c(0, 20000)) + labs(title = "carat < 3")

pl_scales_7 + pl_scale_base2

As you can see, it is no easy to compare both plots as the y-axis are similar.

The argument na.value = allows you to add a value that will replace missing values.

11.1.1.1.4 Transformations

The argument transform = "identity" by default shows the values on the axis. {ggplot2} and {scales} include a wide variate of ways to transform the variables mapped on the x- and y- axis, e.g. log10, exp, log, logit, sqrt, reverse, reciprocal, boxcox, logit, probit … . In addition, {scales} allows you to create your own transformation. Let’s illustrate a tranformation using the log10 transformation. Using transform = log10 R will compute the log with base 10 of the variable mapped on the axis and show the plot using the log-tranformed variable. However, the labels will show the un-tranformed values. To see how this works, let’s return to the life expectancy dataset.

life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = region, size = pop)) +
  geom_point() +
  theme_minimal()
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

Here, the variable gdp_capita is showing a very wide range: the richest economy is more than 100 times richer than the poorest one. A log10 transformation can reduce this range: To show this, log(100,000) = 5 and the log(10) = 1. In other words using a log transformation reduces the range from 1 in 10,000 to 1 in 5. One way to do so would be to calculate the log of the variables in the dataset and use these to plot:

life_df |> filter(date == 2000) |> mutate(loggdp = log(gdp_capita, base = 10)) |>
  ggplot(aes(x = loggdp, y = life_exp, color = region, size = pop)) +
  geom_point() +
  theme_minimal()
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

However, as you can see in the graph: the labels on the horizontal axis are not 3, 4, 5, … . These are not informative in terms of the level of per capita GDP shown. You first have to raise these to the power 10, to see what level of per capita GDP they represent. You could deal with that using breaks and labels in scale_x_continuous():

life_df |> filter(date == 2000) |> mutate(loggdp = log(gdp_capita, base = 10)) |>
  ggplot(aes(x = loggdp, y = life_exp, color = region, size = pop)) +
  geom_point() +
  scale_x_continuous(breaks = c(3, 4, 5), labels = 10^(3:5)) +
  theme_minimal()
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

Using format(c(10^(3:5)), scientific = FALSE) we can even change the scientific notation. However, as you can see, this takes a lot of code. This is where the transformation enter. Using transform = "log10", it is sufficient to add this to your code to get show the plot as if you had plotted it after a mutate(logvar = log(var, base = 10)) command:

life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = region, size = pop)) +
  geom_point() +
  scale_x_continuous(transform = "log10") +
  theme_minimal()
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

You can now set the breaks and labels, and the plot is finished. Using tranform = "reverse" you change the order of the axis. In other words, the values on the x-axis are shown, moving from the left to right, from large to small and, on the y-axis and moving from bottom to top, from high to low. Reversing the order on the y-axis, shows lower levels of life expectancy at birth higher up the axis:

life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = region, size = pop)) +
  geom_point() +
  scale_x_continuous(transform = "log10") +
  scale_y_continuous(transform = "reverse") +
  theme_minimal()
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

This transformation can be useful is you have two plots side by side. The first measures something: high is good (life expectancy at birth), the second measures something: high is bad (infant mortality rates). If you reverse the axis for the latter, it is show high is good and low is bad. Doing so, makes it often easier to compare charts.

For some of these transformation, you can use specific scales: scale_x/y_log10, scale_x/y_reverse and scale_x/y_sqrt. Doing so, you can use scale_x/y_log10() as a substitute for scale_x/y_continuous(transform = "log10").

11.1.1.1.5 Guide, position and secundary axis

The guid = argument allows you to specify the legend. Here, for numeric continuous variables mapped on the x- and y-axis this is less relevant. As you usually wouldn’t include a legend for these axis. Usually, the title and labels on the axis are sufficient as a guide. The position = argument determines where the axis will be shown: for the vertical axis: left or right and for the horizontal axis: bottom or top. For the x-axis, the default of “bottom” and for the y-axis, the default is left. Changing this into “top” and/or “right” changes this:

pl_scale_base + 
  scale_x_continuous(position = "top") +
  scale_y_continuous(position = "right")

You can add a secondary axis to your plot. In that way, you have two axis. dup_axis() copies the primary x- or y-axis. sec_axis() allows for a monotonic transformation from the values on the primary axis to the values on the secondary axis. Using dup_axis() you can add a name, but the breaks and labels are those on the primary axis. For sec_axis() you can also include a monotonic transformation (i.e. a one-on-one transformation with the values on the primary axis.) To illustrate the first, let’s add a duplicate secondary vertical axis to the diamonds plot:

pl_scale_base + scale_y_continuous(sec.axis = dup_axis())

To remove the label from the secondary axis, you can include name = NULL. Using breaks = or labels =, you can also change the breaks and labels on the second vertical axis. To add these, you can use similar methods than those for the primary axis:

pl_scale_base + 
  scale_y_continuous(
    sec.axis = dup_axis(
      name = NULL,
      breaks = c(5000, 15000), 
      labels = seq(5000, 15000, by = 10000)))

Using dus_axis() you can highlight other value ranges on the secondary axis than those on the primary axis. In the plot, observations on the left are relatively low on price, while those on the right are relatively high. You can stress this in the plot by setting different breaks for both axis: for the primary axis, you include more breaks and labels on the lower and of price while you add more breaks on labels on the secondary axis at higher price ranges:

pl_scale_base + 
  scale_y_continuous(
    breaks = c(seq(0, 10000, by = 2500), seq(10000, 20000, by = 5000)),
    labels = paste("$", c(seq(0, 10000, by = 2500), seq(10000, 20000, by = 5000)), sep = ""),
    sec.axis = dup_axis(
      name = NULL,
      breaks = c(seq(0, 10000, by = 5000), seq(10000, 20000, by = 2500)), 
      labels = paste("$", c(seq(0, 10000, by = 5000), seq(10000, 20000, by = 2500)), sep = "")))

Using sec_axis() you can add a secondary axis with a different scale. However, the scale of the secondary axis must be a monotonic transformation of the first. This allows you to show three variables on three axis: one variable mapped on the x-axis, one on the primary y-axis and one on the secondary y-axis. Although this is something that is often done, it is not recommended. A “double” line graph for instance, with one variable mapped on the primary vertical axis and another on the secondary vertical axis suggests correlation, even if the correlation is completely spurious. To illustrate, consider Figure 11.2, which was taken from [Tyler Vigen] (https://tylervigen.com/spurious-correlations)] and shows the distance between the planet Neptune and the earth and the number of burglaries in Kansas. The plot suggests that we should find a way to get Neptune as closes as possible to the earth. Tyler Vigen’s site includes many more of these examples and often includes a GenIA generate motivation why these correlations might exist.

Figure 11.2: Does Neptune cause burglaries in Kansas?

To show correlation, you can use other geometries, e.g. point geometries (where you maps one variable on the horizontal axis and another on the vertical one, Figure 10.1), a path geometry to show co-movement, … .

In some cases, a secondary y-axis is necessary. The first case is where you want to show the same variable but measured in two units (e.g. carat and mg, usd and euro, kilometer and miles, …), where you want to to show a date/time variable but shown in two different time zones. Second, sometimes you want to show two unrelated variables in one plot. Although here you could show two plots, one per variable, often two vertical axis are used to do so. In that case, you need a monotonic transformation from the first variable to the second. I refer to the discussion on [stackoverflow] (https://stackoverflow.com/questions/3099219/ggplot-with-2-y-axes-on-each-side-and-different-scales) for solutions to these with this transformation. To illustrate the case, let’s add a secondary x-axis and a secondary y-axis to the diamonds plot. The first will show the weight of a diamond in mg at the top, the second the price in euro on the right:

pl_scale_base +
  scale_x_continuous(
    sec.axis = sec_axis(
      transform = ~. * 200,
      name = "weight in mg")) +
  scale_y_continuous(
    sec.axis = sec_axis(
      transform = ~. /1.10,
      name = "price in euro", 
      labels = scales::label_currency(
        scale = 1/1.10,
        prefix = "€")))

11.1.1.2 scale_x/y_date() and scale_x/y_datetime() and scale_x/y_datetime()

Recall from Chapter 3 that date and time variables are continuous numeric variables using the number of days since January 1st 1970 of the number of seconds since midnight that day. Date/time variables are often used in data science: each transaction you do in a store has a date/time stamp, sales data is available per day or per month and stock market data is usually available at intervals measured in (parts of) seconds. There are three date/time scale functions: _date() for class data, _datatime() for POSIXct and _time() for data measured in hours, minutes and seconds. Here, I’ll use the scale_x/y_datetime() and focus on the difference with the previous scale. The arguments of this function are

scale_x_datetime(
  name = waiver(),
  breaks = waiver(),
  date_breaks = waiver(),
  labels = waiver(),
  date_labels = waiver(),
  minor_breaks = waiver(),
  date_minor_breaks = waiver(),
  timezone = NULL,
  limits = NULL,
  expand = waiver(),
  oob = censor,
  guide = waiver(),
  position = "bottom",
  sec.axis = waiver()
)

There are a couple of arguments that are specific for POSIXct variables: date_breaks, date_labels, data_minor_breaks and timezone. The other arguments are similar to those for the scale_x/y_continuous scale. Using date_breaks can specify an interval using “sec”, “min”, “hour”, “day”, “week”, “month” or “year” (optionally follows by an s, e.g. “hours”, “days”). The the notation follows “n hours”, “n days”, … . For instance, using date_breaks = "1 month" will return major break points that are one month apart. breaks = for instance is similar to the equivelent argument in scale_x/y_continuous: you can define breaks using a vector or a function. Here, given that the data are POSIXct, the vector should be a date/time vector and the function should return a date/time variable, e.g. seq.Date() or seq.POSIXt. In addition you can use {scales} to add specif date/time break functions:

  • scales::breaks_width(width, offset = 0) with width the width between major breaks using “n hours” and offset if you don’t want breaks to start at e.g. January 1st but at e.g. seven days later (offset = "7 days")
  • scales::breaks_pretty(n = 5) to set n breaks for date/time variables and works in a similar way to scales::extended_breaks(n = ) for continuous variables
  • scales::timespan(unit = c("secs", "mins", "hours, "days", "weeks"), n = 5) to set breaks using time intervals for date/time variables where unit is used to interpret the unit of the timespan (e.g. hours) and n is the desired number of breaks.

Note that date_breaks = "1 month" is equal to using breaks = scales::breaks_width("1 month") (note that the first argument is date_breaks and the second is breaks. You can define minor breaks in a similar way, although the minor break needs to fit within the major breaks (e.g. weeks fit in a month).

To set the labels, you have two options: date_labels or labels. With date_labels, you can use the notation referred to in Table 3.1 and Table 3.2. In other words, date_labels = "%Y" will show 4-digit years as labels, date_labels = "%b" abbreviated months, … In addition, using “”, the notation for “next line” in regular expressions, you can show e.g. months above years. For instance, using date_labels = "%B\n%Y" will show the labels on two rows: the first the full name of the month and on the second row the 4-digit year. Using labels you can add a character vector with labels or a function that generates these labels (paste, paste0, … ). In addition you can use {scales} functions:

  • scales:: label_date(format = "%Y-%m-%d", tz = "UTC" , locale = ) to set the format, time zone and locale (by default the current locale, use locale = Sys.setlocale("LC_TIME", "English") to show months and days in English)
  • scales::label_date_short(format = c("%Y", "%b", "%d", "%H:%M"), sep = "\n") to generate labels on two rows with the first label the e.g. the day or the month and the second the month or the year where the second row is only added on the first and last day of the month or month of the year
  • scales::label_time(format = "%H:%M:%S", tz = "UTC", locale = NULL) to set the format for time, including time zone and locale
  • scales::label_timespan(unit = c("secs", "mins", "hours", "days", "weeks"), space = FALSE, ...) the set a timespan in seconds, minutes, hours, … . You can add a space before the time unit and … allowing you to specify the e.g. accuracy.

If you specify date_breaks, date_labels or date_minor_breaks as well as breaks or labels or minor_breaks the former will be used.

To illustrate, we’ll use the nycflights dataset which you can import from the data > raw directory as nycflights.csv. This dataset is the result of the mutations completed in the “your turn” in Chapter 8.

nyclfights <- read_csv(here::here("data", "raw", "nycflights.csv"))
Rows: 435352 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (12): dep_day, sched_dep_day, dep_month, carrier, tailnum, origin, dest...
dbl  (16): year, month, day, dep_delay, arr_delay, gain, flight, air_time, t...
lgl   (2): dst_on_orig, dst_on_dest
dttm  (6): dep_hhmm, sched_dep_hhmm, arr_hhmm_utc, arr_hhmm_tzc, sched_arr_h...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We will show a plot with the number of flights per day. To do so, we only use the flights with all available data. Recall that the variable status records if an observation has missing data, or not. Using status == complete we filter the data. Using {lubridate}’s floor_date (see Chapter 3), we round the scheduled departure time in POSIXct to the day. Using summarize, we can then calculate the total flights per day. To limit the sample, we use slice_sample(prop = 0.10) and use 10% of the observations. Although we could pipe the resulting data frame in ggplot, we wil first assign it to nyc_day:

nyc_day <- nyclfights |> 
  filter(status == "complete") |>
  slice_sample(prop = 0.10) |> 
  mutate(round_date = floor_date(sched_dep_hhmm, unit = "day")) |>
  group_by(round_date) |>
  summarize(n_flights = n()) |>
  ungroup()

Using a line geometry with the rounded data mapped on the x-axis and the number of flights on the y-axis and accepting all default values, ggplot() shows this graph:

nyc_day |>
  ggplot(aes(x = round_date, y = n_flights)) +
  geom_line() +
  theme_minimal()

The x-axis shows 5 dates: “Jan 2023”, “Apr 2023”, “Jul 2023”, “Oct 2020” and “Jan 2024”. Let’s remove the title of the axis and show labels for every month. To do so, we set breaks = scales::breaks_width("1 month") to set the breaks and labels = scales::label_date_short() to set the labels. If you don’t need to change the language of the days, months, … you can leave the function at its default and use labels = scales::label_date_short(). Here, we add a format (format = c("%Y", "%b", locale = Sys.setlocale("LC_TIME", "English"))) to show dates in English. Change this into, e.g. Spanish, and you’ll get Spanish months.

nyc_day |>
  ggplot(aes(x = round_date, y = n_flights)) +
  geom_line() +
  scale_x_datetime(
    name = NULL,
    breaks = scales::breaks_width("1 month"), 
    labels = scales::label_date_short(format = c("%Y", "%b", locale = Sys.setlocale("LC_TIME", "English")), sep = "\n"), 
    limits = c(as.POSIXct("2023-01-01 00:00:00", tz = "UTC"), as.POSIXct("2023-12-31 23:59:59", tz = "UTC"))) +
  theme_minimal()

As you can see, the every major break now show the month. The first and last month also include the year. Doing so, it is easy to read from the axis when date/time variables change from one year to the other, from one month to the other, … . Let’s change the breaks using date_breaks and set labels using date_labels. We’ll set major breaks per 3 months, add minor breaks per month and show labels using the full name of the month on the first row and the 2-digit year on the second. In addition, we will add a title to the y-axis and remove minor breaks from that axis:

nyc_day |>
  ggplot(aes(x = round_date, y = n_flights)) +
  geom_line() +
  scale_x_datetime(
    name = NULL,
    date_breaks = "3 months", 
    date_labels = "%B\n%y",
    date_minor_breaks = "1 month") +
  scale_y_continuous(
    name = "Number of flights", 
    minor_breaks = NULL) +
  theme_minimal()

Let’s now focus on 1 day and show the number of flights on July 4, 2023 per hour. To select this day, we use month() and mday() from {lubridate}:

nyc_july4 <- nyclfights |> 
  filter(status == "complete" & month(sched_dep_hhmm) == 7 & mday(sched_dep_hhmm) == 4) |>
  mutate(round_date = floor_date(sched_dep_hhmm, unit = "hour")) |>
  group_by(round_date) |>
  summarize(n_flights = n()) |>
  ungroup()

Let’s see what ggplot() returns by default:

nyc_july4 |>
  ggplot(aes(x = round_date, y = n_flights)) +
  geom_line() +
  theme_minimal()

R shows 3 times, including the month and day. To change this, let’s add a major break every hour and show the time as e.g. “5 AM”, “3 PM”. To do so, we use date_breaks = "1 hour" and date_labels = "%l %p" where %l (small L) refers to the day in a 12 hour clock starting at 1 and ending at 12 (%L would start at 01 and end at 12) and %p refers to AM/PM. We add a space between the hour and AM/PM.

nyc_july4 |>
  ggplot(aes(x = round_date, y = n_flights)) +
  geom_line() +
  scale_x_datetime(
    name = NULL,
    date_breaks = "1 hour",
    date_labels = "%l %p") +
  theme_minimal()

You can add limits using limits =. Recall that they have to be of type POSIXct. For instance, starting the graph an hour earlier and ending it one hour later:

nyc_july4 |>
  ggplot(aes(x = round_date, y = n_flights)) +
  geom_line() +
  scale_x_datetime(
    name = NULL,
    date_breaks = "1 hour",
    date_labels = "%l %p", 
    limits = 
      c(as.POSIXct("2023-07-04 04:00:00", tz = "UTC"), 
        as.POSIXct("2023-07-04 23:00:00", tz = "UTC"))) +
  theme_minimal()

11.1.1.3 Color, fill, size and transparency

The color, fill and size aesthetic can be used to show continuous variables. In Figure 9.1, size was used to show the size of the population and color was used to show the region. In that graph, population is a continuous variable, region a discrete variable. Using the aesthetic size to map the population, ggplot() scales the area of the geometry - here a point geometry - to the size of the populations. For continuous variables mapped on the color or fill aesthetic, ggplot() will use a continuous color scale to show the values of that variable. For each of these aesthetics, {ggplot2} adds a legend.

With respect to the color or fill aesthetics, selecting the appropriate colors is not easy. Here, we will not cover the “color theory” but refer to e.g. Lisa Charlotte Muth’ [A detailed guide to colors in data vis style guides] (https://www.datawrapper.de/blog/colors-for-data-vis-style-guides) or Datawrapper’s section on [color in data viz] (https://www.datawrapper.de/blog/category/color-in-data-vis). In addition, not all colors are interpreted in the same way across cultures. Second, people with color blindness will see “colors” differently. In other words, the choice of one or multiple colors to show your data is a lot more difficult than choosing the colors that you happen to like.

In general, there are two types of continuous color scales: sequential and diverging. The first includes one hue and change the “shade” of the color from, e.g. light grey to dark grey or from light red to dark red. Diverging colors use two or more distinct hues, e.g. green and red with a dark or light midpoint. To show continuous variables with e.g. bar of column geometries, there are “binned” versions of continuous color scales. These binnend version assign a discrete color scale to continuous variables.

By default, {ggplot2} uses scale_color/fill_continuous() for continuous color scales. Because the color and fill aesthetics use similar functions, we refer to both at the same time using color/fill as a shortcut to scale_color_continuous() or scale_fill_continuous(). These functions include the following arguments:

scale_colour_continuous(..., type = )

# use ... to set e.g.: 

  name = waiver()
  breaks = waiver()
  minor_breaks = waiver()
  n.breaks = NULL
  labels = waiver()
  limits = NULL
  rescaler = rescale
  oob = censor
  expand = waiver()
  na.value = NA_real_
  transform = "identity"
  guide = "legend"
  position = "left"

Using ... you can set e.g. breaks, limits and labels. The type argument refers to the type of the color scale. By default, this value is gradient. Other options include viridis or a function that returns a continuous color scale. The default gradient implies that {ggplot2} actually defaults to the scale_color/fill_gradient(). In other words using scale_color/fill_continuous() (accepting all defaults) is equivalent to scale_color_gradient() (accepting all defaults). However, using scale_color_gradient() there are more options to set the color scale. This function uses two hues to develop a continuous scale. The arguments of this function are

scale_colour_gradient(
  name = waiver(),
  ...,
  low = "#132B43",
  high = "#56B1F7",
  space = "Lab",
  na.value = "grey50",
  guide = "colourbar",
  aesthetics = "colour"
)

where ... allow you to set breaks, label, minor breaks, …; low is the hue to show low values and is by default “#132B43” and high is the hue to show high values with default “#56B1F7”. Missing values are shown using “grey50”. If you change the low and high values with your own colors, scale_color_gradient() will develop a color scale using those two colors for the low and high values. You can include the colors using their name. R has 657 color names. Using ?colors() in the console, you can see all of them. In addition, you can add a color HEX code. The guide arguments refers to the way the color scale is shown in the legend. Using guide_colorbar() you can modify the way the legend looks and determine e.g. its position. However, these lay-out items can also be set in the non-data parts of the plot. Here, the color scale is shown using a color bar. There are two variations on the gradient scale: scale_color_gradient2() and scale_color_gradientn(). The first

scale_colour_gradient2(
  name = waiver(),
  ...,
  low = muted("red"),
  mid = "white",
  high = muted("blue"),
  midpoint = 0,
  space = "Lab",
  na.value = "grey50",
  transform = "identity",
  guide = "colourbar",
  aesthetics = "colour"
)

includes two additional arguments: mid and midpoint. Here R will use a scale starting at the color for low and moving to the color for high but will use the color mid to show midpoint values. By default, the color for low is “red”, the color for high is “blue” and the color for midpoints (with default value 0) is “white”. You need to define the midpoint from the data if the median or mean is not equal to 0. In other words, by default, the color scale will start at red for low value, change to white to show values around the midpoint and add shades of blue af the values increase. The second variant, scale_color_gradientn()

scale_colour_gradientn(
  name = waiver(),
  ...,
  colours,
  values = NULL,
  space = "Lab",
  na.value = "grey50",
  guide = "colourbar",
  aesthetics = "colour"
)

allows you to add many colors in a vector for the colours or color argument.

Let’s return to scale_x_continuous(). The default value for type is gradient but you can also use viridis. [Viridis] (https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html#usage) scales will also generate good black/white scale (e.g. if printed). If the option viridis is chosen, R actually uses the scale_color_viridis_c() to show continuous viridis scales

scale_colour_viridis_c(
  name = waiver(),
  ...,
  alpha = 1,
  begin = 0,
  end = 1,
  direction = 1,
  option = "D",
  values = NULL,
  space = "Lab",
  na.value = "grey50",
  guide = "colourbar",
  aesthetics = "colour"
)

In the scale name, the _c stands for “continuous”. The option allows you to choose one of the viridis color map. You can refer to those by their name of by a letter. By default the color map is “D” or “viridis. The alpha argument allows you to specify the transparency of the scale. The arguments begin and end include the color for low values and high values. You can change the direction (i.e. begin show high values, end shows low values). Using a higher value for begin will exclude the first proportion of colors from the color map. Lowering the value for end excludes the last proportion of colors from a color map. The full range of color maps includes A to H as well as an illustration of the alpha, begin and end values are shown in Figure 11.3. For these last options, the figure always uses option”J” as a reference.

Figure 11.3: Viridis color scales

scale_color/fill_distiller() uses the [ColorBrewer] (https://colorbrewer2.org/#) scales to create continious scales. The ColorBrewer scales were designed to create attractive maps but are widely used in other applications. The ColorBrewer scales are shown in Figure 11.4:

Figure 11.4: ColorBrewer scales

To use these scales as continuous scale, scale_color/fill_distiller() allows you to select the the ColorBrewer palette in the palette argument using it number of name (Figure 11.4). The type argument allows you to specify if you need a sequential ("seq") or diverging ("div") scale.

scale_colour_distiller(
  name = waiver(),
  ...,
  type = "seq",
  palette = 1,
  direction = -1,
  values = NULL,
  space = "Lab",
  na.value = "grey50",
  guide = "colourbar",
  aesthetics = "colour"
)

For instance, to use the “Pastel2” scale, you can use scale_color_distiller(palette = "Pastel2").

There are many other packages that define color scales. {paletteer} collects all those scales. If you install this package, you can all these scales using scale_color_paletteer_c() and include the name of the package and the name of the scale. In addition, you can add arguments for breaks, labels, … For instance, to use {ggthemes} scale “Orange-Blue Diverging”, you use scale_colour_paletteer_c("ggthemes::Orange-Blue Diverging"). The number of scales, which is very large, is shown at [The Paletteer Gallery] (https://pmassicotte.github.io/paletteer_gallery/#discrete-palettes) and includes scales based on Harry Potter, Vincent Van Gogh, Star Trek, … .

Note that selecting a good color scale is very important for any visualization. Colors often represent values, emotions, … that might differ from one culture to the other. In addition, color blind people will not see all color scales in the same way. To illustrate, Figure 11.5 shows you have people with various types of color blindness see the rainbow:

Figure 11.5: How color blind see the rainbow

A notes on the names breaks, limits and labels arguments that you can use in these color and fill functions. These arguments work in a similar way as they did for scale_x/y_continious(). However, in the latter, they show op as breaks, limits and labels on the x- or y-axis. Here, they will show up as names, breaks, limits and labels in the legend. Using name you can add a name to the legend. Adding breaks = c(1000, 2000) for a color scale, will show these two breaks in the legend, just as the breaks argument for the x- and y- axis set the major breaks. Adding limits = c(0, 10000) will create a continuous starting at 0 and ending at 10000 even if the range of the data is different. With labels you can set labels as you did for continuous scales on the horizontal or vertical axis, e.g. using {scales} to show currencies or percentages.

For the size and alpha aesthetic, you can use the scale_size() or scale_alpha. If you use scale_size/alpha_continuous, R will use scale_size/alpha() and scale the area or the transparency. The arguments of this function are similar to the one for previous functions, with the exception of range:

scale_size(
  name = waiver(),
  breaks = waiver(),
  labels = waiver(),
  limits = NULL,
  range = c(1, 6),
  transform = "identity",
  trans = deprecated(),
  guide = "legend"
)

By default, range sets the size of the smallest value equal to 1 and the size of the largest value equal to 6. For the alpha aesthetic, this range is 0.1 to 1. You can change this if you alter the these numbers. The interpretation of names, breaks, labels and limits is identical: names adds a name to the legend, breaks determine the breaks to show on the legend, labels change the labels on the legend and can be adapted using e.g. {scales} and limits restricts the values that will be shown.

Note that in all these scales, transform is set to identity. Changing this into e.g. log10 will transform the variables and apply the color, size or fill scale to these transformed variables. In addition, you can use line width to map a continuous variable. However, this is rarely done and this aesthetic is used primarily for discrete variables.

11.1.2 Discrete scales

Discrete scales map discrete variable on the x- and y- axis or on other aesthetics such as color, size, shape, alpha, line type or line width. These function are very similar to their equivalents for continuous variables. The position scales scale_x/y_discrete() are used to show the position of the values of the variable mapped on the horizontal and/or vertical axis. The arguments of the scale_x_discrete() function are:

scale_x_discrete(
  name = waiver(),
  ...,
  expand = waiver(),
  guide = waiver(),
  position = "bottom"
)

# Use ... to set e.g. : 
  
  breaks = waiver()
  labels = waiver()
  limits = NULL
  expand = waiver()
  na.translate = TRUE
  na.value = NA
  drop = TRUE
  guide = "legend"
  position = "left"

Most of the arguments were covered for the continuous variants of the scales. The arguments that are new are specific for discrete scales. First, discrete variables allow you to show missing values. By default, na.translate = TRUE shows these missing values. Changing this into FALSE will remove these missing values. na.value = NA shows the missing values on the x- and y-axis on the right hand side. For other aesthetics, you can fill e.g. the color, shape, size, line type of width or alpha to use to show these variables. The argument drop = TRUE is relevant for factor variables. If a factor level is not represented in the data, it will not be shown by default. Changing this into FALSE shows all factor levels. If they are shown but have no values, a bar or column graph will have zero height.

For the labels = argument, you can include a named vector. For instance, in the nycflights dataset, “LGA” is “LaGuardia”, “EWR” is “Newark International Airport” and “JFK” is “John F. Kennedy International Airport” . If you use labels = c("EWR" = "Newark Liberty Int.", "JFK" = "John F. Kennedy Int.", "LGA" = "LaGuardia") the legend will show the full names, not the FAA abbreviation.

scale_color/fill_discrete() by default use scale_color/fill_hue() which uses evenly spaced colors on the color wheel. For instance, if map a variable with 6 values on the fill aesthetic in a bar chart, scale_fill_hue() will default to

show_col(pal_hue()(6))

You can change the luminescence or lightness of the color

show_col(pal_hue(l = 20)(6))

or the location where this scale will start along the color wheel

show_col(pal_hue(h.start = 90)(6))

However, you can not set the chroma or intensity of a color.

Usually, for discrete color/fill scales, the colors are determined from a palette, e.g. ColorBrewer or set manually. For the use, you use scale_color/fill_brewer()

scale_colour_brewer(
  name = waiver(),
  ...,
  type = "seq",
  palette = 1,
  direction = 1,
  aesthetics = "colour")

and use a ColorBrewer palette shown in Figure 11.4. In addition you can use the {paletteer} package’s scale_color/fill_paletteer_d() and select one of the many discrete color scales in this package.

Note that scale_shape() maps discrete variable to only 6 shapes. Here, there are two shortcuts among the arguments: solid = TRUE shows filled shapes; setting this value to false uses open shapes. For scale_linewidth() the key argument is range = c(1, 6) and shows the width of the line showing the largest value relative to the width of the line showing the smallest value.

Usually, the discrete variables that are mapped on color, fill, size, shape, alpha, line width or type is small and you can set them manually. To do so, you can use the scale_aes_manual() function (where aes refers to the aesthetic: color, fill, shape, … ). These functions all have the same arguments:

scale_colour_manual(
  ...,
  values,
  aesthetics = "colour",
  breaks = waiver(),
  na.value = "grey50"
)

# Use ... to set e.g. : 
  
  breaks = waiver()
  labels = waiver()
  limits = NULL
  expand = waiver()
  na.translate = TRUE
  na.value = NA
  drop = TRUE
  guide = "legend"
  position = "left"

The values argument allows you to set the values for the aesthetic: the color (color and fill), the shape, line width or type, size or alpha. To do so, you add these values in a vector. For instance, to set the values of a fill or color scale, you can add the names of colors or the HEX codes: values = c("lightsteelblue", "steelblue", "blue") or c("#7CAE00", "#00B4F0", "#FF64B0"), for shapes and lines you can refer to the number (Figure 10.3, Figure 10.6). The values will be matched with the breaks if specified or in the order of the factor.

In addition, you can also use a named vector, where you specify a color/shape/size, … for every value of the variables that is mapped on the aesthetic. For instance, nycflights includes 3 origin airports: EWR, JFK and LGA. Using color_airports <- c("EWR" = "#00BFC4", "JFK" = "#F564E3", "LGA" = "#00BA38") you can define for every airport a color. It is now sufficient to add values = color_airports and the values will be shown in the selected color. As you can reuse the names vector, you’ll have a consistent color use. As you can reuse this vector in other code, this consistency will show in all other graphs. Doing so, you will always use the same color for e.g. produce names, brands, countries, regions, … . In addition, it also allows you to be consistent in the use of an organisation’s of brand style guide. For instance, KU Leuven’s blues used in the logo are #52BDEC and #00407A, for print, the brand style includes a fixed set of other colors. Using these, KU Leuven can consistently use the same colors in all its print, but also in graphs and tables.

Binned scales are used to when you want to show continuous variables in a plot that shows discrete variables. The variable price in the diamonds dataset for instance is a continuous variable. If you want to use this variable in a column or bar chart to show, e.g. the number of observations. To illustrate the binned position scales, we’ll use scale_x_binned():

scale_x_binned(
  name = waiver(),
  n.breaks = 10,
  nice.breaks = TRUE,
  breaks = waiver(),
  labels = waiver(),
  limits = NULL,
  expand = waiver(),
  oob = squish,
  na.value = NA_real_,
  right = TRUE,
  show.limits = FALSE,
  transform = "identity",
  trans = deprecated(),
  guide = waiver(),
  position = "bottom"
)

The binned scales main arguments are n.breaks = and nice.breaks = TRUE. The first allows you to determine the number of bins. Recall that for a historgram geometry, this was also the case. The default here is 10. The second argument, nice.breaks by default will try to put breaks at “nice” values instead of evenly spread within the limits. Changing this to FALSE will put the breaks at evenly spaced intervals. Two other arguments are right and show.limits. Bins can be open or closed on the right and left. An interval is closed on the right is the last value is included in the interval; it is closed on the left if the first value is included. Here, the first and last value refer to the values at break positions. For instance, suppose that the breaks are 0-2 and 2-4. By default, the last values, 2 and 4 are part of the lower bin: the first bin includes 2 and the second bin includes 4. If a bin is open on the right, the last value is includes in the next bin. In other words, 2 would be part of the second bin and 4 would be part of the third bin. The last argument includes the option to show the limits of the scale as ticks. By default this is not the case.

For color/fill scales, the binned version scale_color/fill_binned()is comparable to the discrete versions scale_color/fill_discrete(). As was the case with the latter, the former will default to another scale: scale_color/fill_steps(), scale_color/filled_steps2(), scale_color/filled_stepsn() where these function have the same interpretations as the scale_color/fill_gradient() variations, to scale_color/fill_viridis_b() to if you set the type to “viridis” or any other scale, e.g. from the {paletteer} package using scale_color_paletteer_binned().

11.2 Guides (legends)

In almost all scale functions, there is an argument guide. With the exception of x- and y- axis, a graph needs a add a guide that show the relation between the values and the aesthetic. Doing so, the guide shows which variables are mapped on the aesthetic and what a color, shape or size represents in terms of values. The scale functions all include an argument guide that lets you change the default values for the guides. As an alternative, you can specify the guides as an independent layer using guides(aesthetic = guide_function), e.g. guides(x = guide_axis()), `guides(color = guide_colorbar)

For continuous and discrete x- and y- axis the guide_axis() includes the following arguments:

guide_axis(
  title = waiver(),
  theme = NULL,
  check.overlap = FALSE,
  angle = waiver(),
  n.dodge = 1,
  minor.ticks = FALSE,
  cap = "none",
  order = 0,
  position = waiver()
)

The main arguments are check.overlap, angle and n.dodge = 1. Sometimes, the labels on the x- of y-axis are very wide and overlap. Consider for instance this example of a column graph with the average life expectancy per region.

life_df |> group_by(region) |> summarize(ave_life = mean(life_exp, na.rm = TRUE)) |>
  ggplot(aes(x = region, y = ave_life)) +
  geom_col()

Here you can see that the labels on the horizontal axis overlap. To deal with that, you can organize them on 2 or more rows. To do so, you can use the n.dodge = argument. By default, the labels are shown on one row, however, setting that level to 2 or 3 changes shows labels on two or three rows. Using 2, there is not overlap left

life_df |> group_by(region) |> summarize(ave_life = mean(life_exp, na.rm = TRUE)) |>
  ggplot(aes(x = region, y = ave_life)) +
  geom_col() +
  guides(x = guide_axis(n.dodge = 2))

Changing the value of check.overlap to TRUE isn’t always helpful. For instance, for this plot, this will cause R to drop the value for “Latin America & Caribbean”.

life_df |> group_by(region) |> summarize(ave_life = mean(life_exp, na.rm = TRUE)) |>
  ggplot(aes(x = region, y = ave_life)) +
  geom_col() +
  guides(x = guide_axis(check.overlap = TRUE))

As a last option, you can change the angle. For instance, setting an angle of 22.5 would solve the overlap. Note that an angle of 90 puts the labels in a vertical position.

life_df |> group_by(region) |> summarize(ave_life = mean(life_exp, na.rm = TRUE)) |>
  ggplot(aes(x = region, y = ave_life)) +
  geom_col() +
  guides(x = guide_axis(angle = 22.5))

The last argument, position allows you to set the labels at the bottom (default), top or for y-axis, left or right. For instance, setting the axis at the top:

life_df |> group_by(region) |> summarize(ave_life = mean(life_exp, na.rm = TRUE)) |>
  ggplot(aes(x = region, y = ave_life)) +
  geom_col() +
  guides(x = guide_axis(n.dodge = 2, position = "top"))

Recall that you can set the position of the scale in the scale functions. For instance, using scale_x_discrete(), you could set position = "top". However, if you then add a guide layer with position at the bottom, R will show the guide at the bottom:

life_df |> group_by(region) |> summarize(ave_life = mean(life_exp, na.rm = TRUE)) |>
  ggplot(aes(x = region, y = ave_life)) +
  geom_col() +
  scale_x_discrete(position = "top") +
  guides(x = guide_axis(n.dodge = 2, position = "bottom"))

The title argument allows you to set the title of the axis. Note that this is also something that you can do in scale_x_discrete or using labs(x = , y = ).

life_df |> group_by(region) |> summarize(ave_life = mean(life_exp, na.rm = TRUE)) |>
  ggplot(aes(x = region, y = ave_life)) +
  geom_col() +
  guides(x = guide_axis(title = "World Bank Regional Grouping",  n.dodge = 2))

The guide for continuous colors is guide_colorbar:

guide_colourbar(
  title = waiver(),
  theme = NULL,
  nbin = NULL,
  display = "raster",
  raster = deprecated(),
  alpha = NA,
  draw.ulim = TRUE,
  draw.llim = TRUE,
  position = NULL,
  direction = NULL,
  reverse = FALSE,
  order = 0,
  available_aes = c("colour", "color", "fill"),
  ...
)

The last argument shows for which aesthetics this guide function can be used: color or fill. To illustrate, let’ use a simple plot using the 2000 data for life expectancy. Here the per capita gdp is mapped on the x-axis, life expectancy on the y-axis as well as on the color aesthetic and the population on the size aesthetic. The x-scale is “log10” transformed.

life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = life_exp, size = pop)) +
  geom_point() +
  scale_x_continuous(transform = "log10") +
  scale_color_viridis_c(option = "magma")
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

The main arguments include position, direction, reverse and order. The first determine the position of the guide: “top”, “bottom”, “left” or “right”. The second the direction. By default, the guide is shown vertical in the left or right position and horizontal in the bottom and top positions. You can change that default by setting the desired direction. The argument reverse allows you to change the order of the colorbar. By default, R shows the highest values at the top. In the example, the colors associated with higher life expectancy at birth are at the top, those with lower levels at the bottom. Setting reverse to TRUE changes that order. The order argument allows you to determine the order of the legends. Here, we have two: one for life expectancy and one for population size. If you would like to show population first and live expectency second, you need to add a guide for size and set the order to 1 for size and the order equal to 2 for the color guide.

Let’s illustrate a couple of these options:

  • Adding a title and putting the guide at the bottom
life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = life_exp, size = pop)) +
  geom_point() +
  scale_x_continuous(transform = "log10") +
  scale_color_viridis_c(option = "magma") +
  guides(color = guide_colorbar(title = "Life expectancy at birth", position = "bottom"))
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

Note two things: first R only shows the colorbar at the bottom. The guide for size stays at the right. Second, note that R changed the direction: the colorbar is now shown horizontally.

  • Reversing the order in the legend
life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = life_exp, size = pop)) +
  geom_point() +
  scale_x_continuous(transform = "log10") +
  scale_color_viridis_c(option = "magma") +
  guides(color = guide_colorbar(reverse = TRUE))
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

Note that R doesn’t change the colors in the plot, only the order in the guide. Low values for life expectancy are now shown first, higher values last. Changing the direction

life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = life_exp, size = pop)) +
  geom_point() +
  scale_x_continuous(transform = "log10") +
  scale_color_viridis_c(option = "magma") +
  guides(color = guide_colorbar(direction = "horizontal"))
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

Again note that R only changes the orientation of the color guide, not the size guide.

For the other aesthetics, that are usually to show discrete values guide_legend allows you to set the options

guide_legend(
  title = waiver(),
  theme = NULL,
  position = NULL,
  direction = NULL,
  override.aes = list(),
  nrow = NULL,
  ncol = NULL,
  reverse = FALSE,
  order = 0,
  ...
)

Here, there are 3 relevant arguments: nrow and ncol and override.aes = list(). Lets return to the previous graph. Population is shown with 5 levels. Using direction, you can change the direction from vertical to horizontal. However, using nrow and ncol you can show the various values on nrow rows and ncol columns. For instance to show the population legend (aesthetic size) on 3 rows of 2 columns:

life_df |> filter(date == 2000) |>
  ggplot(aes(x = gdp_capita, y = life_exp, color = life_exp, size = pop)) +
  geom_point() +
  scale_x_continuous(transform = "log10") +
  scale_color_viridis_c(option = "magma") +
  guides(size = guide_legend(nrow = 3, ncol = 2))
Warning: Removed 23 rows containing missing values or values outside the scale range
(`geom_point()`).

To illustrate the override.aes argument, we’ll use the diamonds dataset and show the price-carat plot with cut mapped on the color aesthetic. We’ll add settings to show the points: their size should be one and the alpha value 1/2:

diamonds |> slice_sample(prop = 0.10) |>
ggplot(aes(x = carat, y = price, color = cut)) +
  geom_point(size = 1, alpha = 1/4) +
  scale_color_viridis_d(option = "magma") 

Note that R uses these size and alpha values also in the guide. To change that, you need to override the aesthetics using override.aes = list. The list should then include the alternative settings to use in the guide, e.g. size = 4 and alpha = 1

diamonds |> slice_sample(prop = 0.10) |>
ggplot(aes(x = carat, y = price, color = cut)) +
  geom_point(size = 1, alpha = 1/4) +
  scale_color_viridis_d(option = "magma") +
  guides(color = guide_legend(override.aes = list(size = 5, alpha = 1)))

Now, the legend shows points with size 5 and without any adjustment for transparancy.

Note that here too, you can change the position, add a title, … .

11.3 Faceting

Graphs where you map variables on 3 of 4 aesthetics are not always easy to read. For instance, a plot that includes a color for product category, a shape to show “premium”, “medium” of “budget” price ranges and a size to represent sales volumes contains too much information to process. One way to deal with this is to split them up into subplots. This is what faceting does. To illustrate, let’s reuse the following plot:

diamonds |> slice_sample(prop = 0.10) |>
  ggplot(aes(x = carat, y = price, color = cut)) + 
  geom_point() +
  theme_minimal()

There are three facet functions: facet_null() which doesn’t show facts and is the default, facet_wrap() and facet_grid(). The latter is ideal if you have multiple faceting variable and want to show one in the rows and the other in the columns. The former is ideal is you have one factor. Before we start, let’s see what facet_wrap() does:

diamonds |> slice_sample(prop = 0.10) |>
  ggplot(aes(x = carat, y = price, color = cut)) + 
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(cut))

Here we have 5 subplots: one for every level of cut. The subplots retain the color. Note also that the x-axis on the first two plots in the top row is removed and that the same holds for the y-axis in the plots in the second and third column. The value for each level of cut is included at the top of each subplot. There are many aspects that you can control using the facet functions. To see which ones, let’s start with facet_wrap(). The arguments include:

facet_wrap(
  facets,
  nrow = NULL,
  ncol = NULL,
  scales = "fixed",
  shrink = TRUE,
  labeller = "label_value",
  as.table = TRUE,
  switch = deprecated(),
  drop = TRUE,
  dir = "h",
  strip.position = "top",
  axes = "margins",
  axis.labels = "all"
)

The first argument defines the faceting groups. Using vars(var1, var2, ...) you can add one or multiple variables. The second and third argument allow you to specify the number of row and columns. In the example, facet_wrap() used 2 rows and three columns. In the example the scales are fixed: all plots have the same range for both price and carat. The other extreme is free. Using that option, all scales will depend on the range of values within that single subplot. In between these two extremes, free_x or free_y allow one of both scales to depend on the range of values while the other is fixed. The default option makes most sense as it maximizes the comparability among subplots. In case you also plot summary statistics, the shrink arguments will, by default, shrink the range of the scales to fit the range of the statistics. Setting these value to FALSE, the output will show the range in the data. This might be relevant is you use geom_smooth() without a point of line geometry. By default, R will shrink the range of the y-axis to fit the values of the smoothed function. The labeller argument allows you to specify the labels on top of each subplot. To illustrate, let’s create a second factor using the table variable in diamonds with two levels, map this factor to the shape aesthetic and use facet_wrap() with the default value for labeller: "label_value":

dia <- diamonds |> mutate(table_fac = case_when(table < 57 ~ 0, table >= 57 ~ 1))
dia$table_fac <- as.factor(dia$table_fac)


dia |> ggplot(aes(x = carat, y = price, color = cut, shape = table_fac)) +
  geom_point() +
  facet_wrap(
    vars(cut, table_fac), 
    labeller = "label_value")

The plot shows the values for both factors, but doens’t show the factor names. To see those, you have to look at the legend. Using labeller = "label_both" shows both the name of the factor as well as the value:

dia |> ggplot(aes(x = carat, y = price, color = cut, shape = table_fac)) +
  geom_point() +
  facet_wrap(
    vars(cut, table_fac), 
    labeller = "label_both")

The as.table argument defines the order of the subplots. By default, the largest value is shown “as a table”: in the bottom right part of the plot. Setting this value to FALSE, show the subplots in the order in which you would expect them in a plot: the higest values are shown in the top right corner. drop determines what happens in case there are factor values that do not appear in the plot. By default, these are dropped from the subplots. In other words, if there wouldn’t be any values for e.g. Premium cut diamonds, that subplot would not be included in the facet. Chaning this to FALSE will show an empty plot. R fills the “matrix” in horizontal direction: dir = "h". Using "v" will first fill in vertical direction. The axis = margins arguments will by default draw the axis at the margins of the plot. In the examples: the y-axis is shown only on the left side and the x-axis only at the bottom of a column. Using all every subplot will include both x- and y-axis. Using all_x or all_y one of the two axis is added to all plots. In case axis is all, the axis.labels = "all" argument determines how many labels are shown: all interior labels are shown. If axis = all_x/y R draws labels on the interior x- of y-axis.

To illustrate facet_grid, let’s use the diamonds dataset with the second factor for table:

dia |> ggplot(aes(x = carat, y = price, color = cut, shape = table_fac)) +
  geom_point() +
  facet_grid(
    vars(cut, table_fac))

Notice how R, by default, plots all subplots in vertical direction and fixes all axis, both horizontal as well as vertical. The subplots are ordered in both factors: starting with the first lowest levels for cut and for each level of cut the values for the second table_fac factor. The arguments of the facet_grid() function are

facet_grid(
  rows = NULL,
  cols = NULL,
  scales = "fixed",
  space = "fixed",
  shrink = TRUE,
  labeller = "label_value",
  as.table = TRUE,
  switch = NULL,
  drop = TRUE,
  margins = FALSE,
  axes = "margins",
  axis.labels = "all",
  facets = deprecated()
)

Most are familiar from facet_wrap with one important exception: there is no facets argument (it can be used but is is deprecated) and nrows and ncols have been replaced with rows and cols. The reason is that the rows and cols arguments are means to include the vars() argument. R will then show the facets in a matrix format with the col showing the values of the factor in the columns and rows showing the values of the factor in the rows. Using the two factor example:

dia |> ggplot(aes(x = carat, y = price, color = cut, shape = table_fac)) +
  geom_point() +
  facet_grid(
    rows = vars(table_fac), 
    cols = vars(cut))

R now shows the matrix format for the facet. By default all subplots have the same size: space = "fixed". Using free both the height and width of the subpanels will differ. Using free_x or free_y allows a different width or height but not both. All other arguments are simular to these in facet_wrap.

Note that you don’t need to include the faceting variable in the aesthetics. You can also use the variable only in the facet function. Doing so will loose the aesthetic used to show the factor. For instance, if we remove color = cut and include cut in facet_wrap():

diamonds |> slice_sample(prop = 0.10) |>
  ggplot(aes(x = carat, y = price)) + 
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(cut))

R shows the facets, but as there is not mapping on the color aesthetic, does so in the default color.

Another useful option is to show all the data in each panel, but to do so in a lighter color. To do so, you need to create a dataset without the factor. Using that dataset, you create geometry which only includes the x- and y-axis. In the second geometry, you use the original dataset to add the color aesthetic which you also use to facet. With the diamonds dataset, let’s first create a data frame without cut:

diamonds2 <- diamonds |> select(-cut)

Now we use that dataframe in the first layer:

ggplot() +
  geom_point(data = diamonds2, aes(x = carat, y = price), color = "lightgrey") +
  theme_minimal()

We now add the second layer and use facet_wrap():

ggplot() +
  geom_point(data = diamonds2, aes(x = carat, y = price), color = "lightgrey") +
  geom_point(data = diamonds, aes(x = carat, y = price, color = cut)) +
  theme_minimal() +
  facet_wrap(vars(cut)) 

Now every subplot shows all variables. Each subplot uses the color aesthetic to show the values that match the value of the factor. Here, you can nicely see if there are any major differences between the full dataset on the one hand and the data in the subplots on the other.

11.4 Coordinates

By default R uses the Cartesian coordinate system shown in fig-cartesian. We covered most of the aspects of coordinate systems in the previous parts, so here, we will recap the most important parts. The coord_cartesian() function includes the following arguments:

coord_cartesian(
  xlim = NULL,
  ylim = NULL,
  expand = TRUE,
  default = FALSE,
  clip = "on")

The arguments xlim, ylimand expand were covered. However, it is useful to stress that xlim and ylim, in contrast to the limits argument in scales_x/y_continuous do not remove the values from the dataset when you include statistics in your plot, e.g. a smoothed line. In other words, it is often useful to set the limits in this layer rather than the scales layer. By default, R wil create a warning message if the coordinate system is replaced. Turning default = TRUE changes this. The clip = on means that the drawing is clipped to the size of the panel. This is the most useful option. If this is set to off, it allows to draw points and lines anywhere on the plot, even outside of the panel. In other words, here you could have points in the margins.

To fix the ratio between the y and x axis, you can use coord_fixed(). By default, this ratio is 1: a one unit in change on the horizontal axis equals the same unit change on the vertical one. You can change this to e.g. 2. In that case a one unit increase on the horizontal axis represents a two unit increase on the vertical axis.

coord_fixed(
  ratio = 1, 
  xlim = NULL, 
  ylim = NULL, 
  expand = TRUE, 
  clip = "on")

Polar coordinates are used to e.g. draw pie charts. The arguments of the coord_polar() function are:

coord_polar(
  theta = "x", 
  start = 0, 
  direction = 1, 
  clip = "on")

Here theta refers to the variable that will be used for the angle in the polar charts, start determines the start position and direction sets the direction around the polar coordinate with 1 equal to clockwise and -1 equal to counter clockwise. To illustrate the code for a pie charts, we will use a sample of the diamonds dataset. The first part sets of the “normal” bar chart. Here, we include a variable factor(1) for the x-axis and use the aesthetic fill to map the clarity. Note that the bar chart shows the number of observations on the vertical axis. We also add a scales layer and, using labs() the title.

dia10 <- diamonds |> slice_sample(prop = 0.10)

dia10 |>
ggplot(aes(x = factor(1), fill = clarity)) +
  geom_bar(width = 1)  +
  scale_fill_manual(values = c("#DFFF00", "#FFBF00", "#FF7F50", "#9FE2BF", "#40E0D0", "#6495ED","#CCCCFF","#E8CCD4" )) +
  labs(
    title = "Number of observations for each clarity group",
    x = "",
    y = "") +
  theme_minimal()

We then use polar coordinates to change the bar charts into a pie chart. To remove the axis, we use theme_void():

dia10 |>
  ggplot(aes(x = factor(1), fill = clarity)) +
  geom_bar(width = 1)  +
  scale_fill_manual(values = c("#DFFF00", "#FFBF00", "#FF7F50", "#9FE2BF", "#40E0D0", "#6495ED","#CCCCFF","#E8CCD4" )) +
  labs(
    title = "Number of observations for each clarity group",
    x = "",
    y = "") +
  coord_polar(theta = "y", start = 0, direction = 1) +
  theme_void() 

If you are interested and would like to see how polar coordinate work: check the “Polar coordinates” box.

To start, let’s look at two points in a Cartesian coordinate system Figure 11.6

Figure 11.6: Points in a cartesian coordinate system

In Figure 11.6, you can see that there are two ways to represent a point. The first, using the cartesian coordinates, refers to the x and y values. The first, blue point is given by the pair (750, 400) and the second - the green point - is given by the pair (500, 866). Using these pairs, we can identify every point in the cartesian coordinate system. But you can see that there is also a second way to do so. Starting the the origin, every point can be represented by an angle and a radius. For the first point, the angle is 28.072° and the radius is 850; the second point’s angle is 60° and the radius is 1000. Note that all points on the red circle have a radius of 1000 (if one point on the circle has a radius of 1000, so must all others). The fact that you can represent a point using the angle and the radius is important for polar coordinates. Note that here we measure angle counterclockwise.

Recall that a circle is 360°. The part of the y-axis above the horizontal axis is at 90°, the part of the x-axis on the left of the vertical axis at 180°, the part of the y-axis below the x-asis at 270° and the part of the x-axis or the right of the vertical axis is at 0° or 360°. For angles larger than 360, you move around the circle more than once. For instance, 450° equals one time around the circle (360°) + 90° and will be shown on the part of the y-axis above the horizontal axis. In other words, all dots with the same radius but with angles equal to 360, 720, … are on the same spot and on the same part of the x-axis, all dots with the same radius but angles 45°, 405°, 765° are also on the same spot.

Let’s now consider the straight line with Cartesian coordinates in Figure 11.7. The x-axis starts at 0 and end at 1000. The range of the y-axis is 50 to 1050. In other words, the equation for the straight line is

\[ y = 50 + x \]

The major breaks are set every 180 on the horizontal axis and every 250 on the vertical axis. The major breaks on the horizontal axis are shown in blue, thoese on the vertical axis are shown in red.

Figure 11.7: A line in a cartesian coordinate system

A circle is 360°, half a circle is 180°, … . In other words we can rescale the horizontal axis to “times 360°” (?fig-pointsinangleradium). Doing so, we can look at these points as “angles”: with x = 180, 180 is half a circle and represents an angle of 180. Likewise, with x = 360, 360 is a full circle and represents an angle of 360° or 0°. If x is larger than 360, e.g. 540, this represents 1.5 times 360°. In other words, is equals full movement around the circle + half a movement. In other words, this point will be shown at a 180° angle. 720 is two times 360, or two movements around the circle. In other words, the angle is 0°. ?fig-pointsinangleradius also re-interprets the y-axis: this now show the radium.

Figure 11.8: A line in a cartesian coordinate system with degrees on the horizontal axis an radius on the vertical axis

Doing so, we can reinterpret every point on the line in terms of an angle - measures on the horizontal axis - and a radius - measured on the vertical axis. The circles show points where the angle is 180° or 0°. Given the equation for the straight line: if x = 180 and y = 230 we would interpret that purple point as a point with an angle of 180° and a radius of 230, the blue point (x = 360, y = 410) shows a combination of a point with angle 0° and radius 410, (540, 590) a point with angle 180° (one time around the circle + 180°) and radius 590, … . fig-angleradius180360 shows these points. Here you can see that a point such as 180° with radius 230 is shown on the horizontal axis, with radius 230. Doing a full circle, a point such as (360, 410) is shown on the horizontal axis, … .

Figure 11.9: Points with angle 180° or 360°

If you would want to connect these dots on the order in which they appear on the line in ?fig-pointsinangleradius, you would have to draw expanding circles.

The triangles in Figure 11.8 show points with an angle equal to 60. On the x-axis, these points are 60, 420 and 780 (60°, once times the circle + 60°, two times around the circle + 60°). The diamonds show points with an angle of 28.072° (28.072, 388.072, …). Figure 11.10 shows these points.

Figure 11.10: Points with angle 60° or 28.072°

Notice how all these points lie on the same straight line out of the origin: as all points with x-values of 60, 420, … in ?fig-pointsangleradium represents points that are on a line in the circle with an angle of 60°, they are all on that line. The only thing that makes them different is their radius. The same holds for the other line: all points with x-values in ?fig-pointsangleradius of 28.072, 388.072, … all all on a line with angle 28.072°. They differ as they have a different radius. Note that here too, connecting all points in the order in which they appear on the line in ?fig-pointsangleradius would requires you to draw expaning circles. ?fig-angleradiusall slows all points in both Figure 11.9 and Figure 11.10

Figure 11.11: Points with angle radius

Let’s now see what polar coordinates do. The is shown in Figure 11.12. Recall that the major break lines on the y-axis were shown in read in ?fig-pointsangleradius. There, you can see these lines as circles round the origin. The outer circle shows the “angle” using the full range of x-values. In other words, starting from 0 and moving counterclockwise, every dot is now shown by and angle starting in the origin towards to outer circle where the angle is shown and ending at the radius for that point.

Figure 11.12: Polar coordinates

Figure 11.12 is the same as Figure 11.7 but using the angle and radius on the circle and not on a Cartesian way.

Let’s now use diamonds to construct a pie chart that shows the relative share of every level of clarity for diamands with cut = “very good”:

diamonds |> filter(cut == "Very Good") |>
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar() +
  theme_minimal()

Using polar coordinates, the pie chart is

diamonds |> filter(cut == "Very Good") |>
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar() +
  coord_polar(theta = "y") +
  theme_minimal()

How do you get to this plot. For what we actually want to show is the relative share of each level of clarity in the “Very good” cut category:

diamonds |> filter(cut == "Very Good") |>
  group_by(clarity) |>
  summarize(n = n()) |>
  mutate(perc_n = n/sum(n)) |>
  mutate(n_360 = perc_n * 360)
# A tibble: 8 × 4
  clarity     n  perc_n n_360
  <ord>   <int>   <dbl> <dbl>
1 I1         84 0.00695  2.50
2 SI2      2100 0.174   62.6 
3 SI1      3240 0.268   96.5 
4 VS2      2591 0.214   77.2 
5 VS1      1775 0.147   52.9 
6 VVS2     1235 0.102   36.8 
7 VVS1      789 0.0653  23.5 
8 IF        268 0.0222   7.99

Here, we have the percentage share of for every level clarity in the column perc_n. That percentage is the “share” of the pie in a pie chart. How does R move get to this percentage? The inuittion is this. Let’s show this summary table in a column geometry and add the column n_360 as a label. This column show the relative share of the pie that a level of clarity would take using the “angle” defined as the 360 * the percentage share. In other words, the category “I1” would start from a zero angle and continue to an angle equal to 2.5. The second cateogry would start at an angle of 2.5 and end at and angle of 65.07. The third would start at 65.07 and end at an angle of 161.61, … .

diamonds |> filter(cut == "Very Good") |>
  group_by(clarity) |>
  summarize(n = n()) |>
  mutate(perc_n = n/sum(n)) |>
  mutate(n_360 = perc_n * 360) |>
  ggplot(aes(x = clarity, fill = clarity)) +
  geom_col(aes(y = n_360)) +
  geom_text(aes(y = n_360, label = round(n_360, 2)), nudge_y = 2)

Because we have only one value for “cut” the radius is always 1. In other words, R will show scale every category of clarity with an angle defined by n_360 and a radius of 1. To show this pie, we need to tell R where the radius is. In this case, it is on the y-axis of the bar chart: the number of observations per group. To do so, we add "theta = "y". This tells R that the radius is on the horizontal axis (which is 1) and the angle on the vertical axis:

diamonds |> filter(cut == "Very Good") |>
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar() +
  coord_polar(theta = "y") 

The pie chart shows in the outer circle (the angle) the number of observations. The angle for the first is very small: 85 of 2.5. Adding these observations along the outer circle, R includes a new category at every new “angle”: 65.07, 161.60 (2.5+ 62.6+ 96.5). Adding all these angles the circle will be 360 and the slices show the relative importance of every category.

11.5 Themes: the non-data parts of your graph.

11.5.1 Themes

The examples almost always include a layer theme_minimal(). Doing so, the panel of the plot was white, axis grey, … . There are various themes build in {ggplot2}: theme_minimal() is just one example. [Here] (https://ggplot2.tidyverse.org/reference/ggtheme.html) on in the [complete theme section] (https://ggplot2-book.org/themes#sec-themes) of Chang (2025) you can find examples of plots using these themes. In addition, {[ggthemes] (https://jrnold.github.io/ggthemes/index.html)} (Arnold (2024)) includes about 20 themes, in addition to color, shape and line type scales.

By default, R uses a standard theme. You can get that standard theme using theme_get(). Here the code is not evaluated as R returns a nested list of 136. However if you run that code in the console you’ll see all list components. Themes are, in other words, lists.

theme_get()

You can set the theme - R will override all themes you use in your code - using the theme_set() function.

theme_set(theme_light())

Using an existing theme, you can add modifications to specific components using the theme_update() functions. To do so, it is a good idea to first make a copy of the theme you want to change. To do so, you can assing the “old” theme to a new theme name and use theme_set(new_name_theme). All updates will now happen with the copy of the old theme. Doing so, you can always reuse that old theme. Using theme_update() you can not modify specific parts of the theme. All other theme elements are unaffected. In other words, this function is ideal is you want to add a couple of modifications to some theme elements, but are satisfied with the other ones. To use this function, you set a new theme name and add the changes to the elements. For instance, suppose that you want to change the background color of the panel to white, add margins between the panels and the background if 1 cm on each side, set the font size and families for the title, labels, captions, change the major grid lines, remove minor gridlines, … but you are satisfied with all other elements of the “old” theme. You can run:

new_name_theme <- theme_update(
    plot.background = element_rect(fill = "white", color = "white"),
    plot.margin = margin(1, 1, 1, 1, "cm"), 
    plot.title = element_text(size = 12, family = "serif", face = "bold", color = "darkgrey"), 
    plot.title.position = "panel",
    plot.subtitle = element_text(size = 10, family = "sans", face = "italic", color = "grey"), 
    plot.caption = element_text(size = 8, face = "bold.italic", color = "lightgrey"),
    panel.background = element_rect(fill = "white"), 
    panel.grid.major.x = element_line(color = "#F2F1F1", linetype = "solid", linewidth = 0.5),
    panel.grid.major.y = element_line(color = "#F2F1F1", linetype = "solid", linewidth = 0.5),
    panel.grid.minor.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    axis.line.x = element_line(color = "#D6D5D5", linewidth = 1),
    axis.line.y = element_line(color = "#D6D5D5", linewidth = 1),
    axis.title.x = element_text(color = "darkgrey", size = 10, family = "serif", face = "italic"), 
    axis.title.y = element_text(color = "darkgrey", size = 10, family = "serif", face = "italic"), 
    axis.text.x = element_text(color = "lightgrey", size = 10, family = "serif", face = "plain"), 
    axis.text.y = element_text(color = "lightgrey", size = 10, family = "serif", face = "plain"), 
    legend.background = element_rect(fill = "white"), 
    legend.title = element_text(size = 10, face = "bold", color = "#CCCCFF"), 
    legend.key = element_rect(fill = "white"),
    legend.byrow = FALSE, 
    legend.position = "right")

Doing so, you update the theme with your personal style. If you add + new_name_theme() after the plot, these values will apply.

However, this function requires that you know which elements you can actually change. This is what we”ll show next. Note that you don’t have to know all these. It is actually sufficient to now that you can change a lot and how you can do that in case you need to. Using theme_update() you start from an existing theme and update same of the values. To design a theme from scratch, you can use

11.5.2 Theme elements

Themes change the elements of the plot using 4 functions:

  • element_text() allows you to set the font (family, face, color, size (in points)) and the adjustment (hjust, vjust, angle) . There are three font families that work everywhere: “sans” (the default), “serif” and “mono”. For other families, the font in one system (e.g. Windows) will not always show in the same way on another system (e.g. Mac). The face is “plain”, “bold”, “italic” and “bold.italic”. For the font size, which is measured in points (1pt = 0.35 mm). We alreay covered the justification: hjust and vjust: you can set them using (hjust) 0 for left, 0.5 for center and 1 for right and vertical alignment you have 1 for top, 0.5 for middle and 1 for bottom. The angle is measured in degrees.

  • element_line() allows you to set parameters for all line elements in the themes. Here, you can set color, and line width and line type.

  • element_rect is used to set parameters for all surfaces and their borders, e.g. the panel background, plot background, … . You can set the fill and border colors (fill and color), line width and line type.

  • element_blank() to draw noting, e.g. to remove labels from the axis.

To set all text elements, you can use theme(text = element_text()). Doing so allows you to specify e.g. the font family, the size of the basic font, … all subsequent uses of element_text() will inherit these values. Doing so, you only need to specify the changes relative to these default values. In some cases, you can even include rel(1.5) to set the size of a specific text component equal to 1.5 times the default size. The same holds for line, rect and title (all titles including axis, …).

Let’s look at the various components of the non-data parts. First there are the axis. Most theme elements include elements such as axis.title and axis.ticks but also axis.title.x , axis.title.y, axis.title.x.top and axis.title.x.bottom or axis.ticks.x and axis.ticks.y, … If you specify the elements for axis.title of axis.ticks you do so for all axis titles or ticks. If you need a different style of text or tick, you have to use axis.title.x and axis.title.y.

A plot includes a panel, axis including labels, a title, subtitle, a caption and tags and the guides. These are plotted against a background. The plot.background refers to the entire area of the plot. Using element_rect() you can change the fill color and the border (line color, width and type). Within the plot background, you find the other components. Using plot_margin you can set the top, right, bottom and left (starting at the top and moving clockwise) in units. These two elements set the background, border and margins:

plot.background  
plot.margin

You have now defined the background of the plot: the background color, a border and the area within which the plot will be shown.

The title, subtitle, captions are set using element_text() and their position using two values: “panel” (default) and “plot”. The latter aligns their position with the plot, the former with the panel. For the tag (e.g. Fig. 1), the location is “panel”, “plot” or “margin” (in the panel area, in the plot area or in the margins) and the position is “topleft”, “top”, “topright”, “left”, “right”, … . You can set tags using the labs() function.

plot.title
plot.title.position
plot.subtitle
plot.caption
plot.caption.position
plot.tag
plot.tag.position
plot.tag.location

The next component is the panel area which includes the panel, as well as axis, axis labels, … . With respect to the panel you can set the background color and border as well as the major and minor grid lines. For the first you need element_rect(), for the grid lines, you need element_line(). You can set the following elements of the panel:

panel.background
panel.border
panel.grid
panel.grid.major
panel.grid.minor
panel.grid.major.x
panel.grid.major.y
panel.grid.minor.x
panel.grid.minor.y

With multiple panels (facets) you can also set the distance between subpanels in units():

panel.spacing
panel.spacing.x
panel.spacing.y

With respect to the axis titles, you use element_text() to set the font family, face, size and alignment. You can set these for the following axis elements:

axis.title
axis.title.x
axis.title.x.top
axis.title.x.bottom
axis.title.y
axis.title.y.left
axis.title.y.right
axis.text
axis.text.x
axis.text.x.top
axis.text.x.bottom
axis.text.y
axis.text.y.left
axis.text.y.right
axis.text.theta
axis.text.r

To adjust the axis lines, you use element_line() to adjust line width, line type, … . If you use element_blank() no lines are shown. The axis lines include:

axis.line
axis.line.x
axis.line.x.top
axis.line.x.bottom
axis.line.y
axis.line.y.left
axis.line.y.right
axis.line.theta
axis.line.r

You can change the color, … of the ticks and minor ticks using `element_line(). These tick elements are

axis.ticks
axis.ticks.x
axis.ticks.x.top
axis.ticks.x.bottom
axis.ticks.y
axis.ticks.y.left
axis.ticks.y.right
axis.ticks.theta
axis.ticks.r
axis.minor.ticks.x.top
axis.minor.ticks.x.bottom
axis.minor.ticks.y.left
axis.minor.ticks.y.right
axis.minor.ticks.theta
axis.minor.ticks.r

You can set tick mark lengths using (unit(x, "cm") or unit(y, "mm") with default points:

axis.ticks.length
axis.ticks.length.x
axis.ticks.length.x.top
axis.ticks.length.x.bottom
axis.ticks.length.y
axis.ticks.length.y.left
axis.ticks.length.y.right
axis.ticks.length.theta
axis.ticks.length.r
axis.minor.ticks.length
axis.minor.ticks.length.x
axis.minor.ticks.length.x.top
axis.minor.ticks.length.x.bottom
axis.minor.ticks.length.y
axis.minor.ticks.length.y.left
axis.minor.ticks.length.y.right
axis.minor.ticks.length.theta
axis.minor.ticks.length.r

We can now move to the guides (legends). First you can set the position of the legend (legend.position) at the “left”, “right” “bottom” or “top” of the panel or “inside” the panel. In the latter case, you need to specify a vector with the x,y position of the legend in legend.position.inside. The orientation of the legend “vertical” or “horizontal” is set using legend.direction. With multiple legends, these can be shown by row or by column. Using legend.byrow = FALSE multiple legends are organized in one column. You can set the spacing between legends and the margins around the legends.

Using element_text() you can change the titles (legend.title) and the legend item labels (legend.text). You can also change the position of the title (legend.title.position) relative to the main title and the position of the legend text (legend.text.position) relative to the legend keys (“top”, “right”, “bottom”, “left”. You can adjust the background color of both the legend as well as the background underneath the legend keys. You can set multiple dimensions of the legend keys.

The position the legend relative to the panel of the plot, you can use the legend justification elements. A box adds a box around all legends.

legend.position
legend.position.inside
legend.direction
legend.byrow
legend.margin
legend.spacing
legend.spacing.x
legend.spacing.y
legend.text
legend.text.position
legend.title
legend.title.position
legend.background
legend.key
legend.key.size
legend.key.height
legend.key.width
legend.key.spacing
legend.key.spacing.x
legend.key.spacing.y
legend.frame
legend.ticks
legend.ticks.length
legend.axis.line
legend.justification
legend.justification.top
legend.justification.bottom
legend.justification.left
legend.justification.right
legend.justification.inside
legend.location
legend.box
legend.box.just
legend.box.margin
legend.box.background
legend.box.spacing

With facets, you have further options to define the layout of these facts, including, e.g. the color and background of the value strips and the text labels for the x- and y- axis.

 strip.background
 strip.background.x
 strip.background.y
 strip.clip
 strip.placement
 strip.text
 strip.text.x
 strip.text.x.bottom
 strip.text.x.top
 strip.text.y
 strip.text.y.left
 strip.text.y.right
 strip.switch.pad.grid
 strip.switch.pad.wrap

11.5.3 Branding your visualization

Many organizations have a logo, a brand name, … you can add these to your visualization. To see how, I refer to e.g. [You Need to Start Branding Your Graphs. Here’s How, with ggplot!] (https://rpubs.com/mclaire19/ggplot2-custom-themes) for a simple illustration of how you can add a logo to your visualization.